Three Algorithms for Cholesky Factorization on Distributed Memory Using Packed Storage

نویسندگان

  • Fred G. Gustavson
  • Lars Karlsson
  • Bo Kågström
چکیده

We present three algorithms for Cholesky factorization using minimum block storage for a distributed memory (DM) environment. One of the distributed square blocked packed (SBP) format algorithms performs similar to ScaLAPACK PDPOTRF, and with iteration overlapping outperforms it by as much as 67%. By storing the blocks in a standard contiguous way, we get better performing BLAS operations. Our DM algorithms are almost insensitive to memory hierarchy effects and thus gives smooth and predictable performance. We investigate the intricacies of using RFP format in a DM ScaLAPACK environment and point out some advantages and drawbacks. 1 Near Minimal Storage in a Serial Environment Rectangular full packed (RFP) format is a standard full storage two-dimensional array for triangular or symmetric matrices requiring minimum storage [3]. For the lower triangular case, blocks A11, A21, A22 are stored as submatrices in a rectangular full storage array. This allows for using level 3 BLAS as well as making it easy to write LAPACK-style code for this format [3]. SBP format is a generalization of standard full storage. The matrix is partitioned into square blocks of order NB, and in the case of storing symmetric or triangular matrices, only the triangular blocks are stored. Each square block is contiguous in memory; the blocks are either stored row or column-wise. Each square diagonal block wastes NB(NB-1)/2 elements, a total of N(NB-1)/2 elements summed over all N/NB diagonal blocks. Each square block will map into L1 cache in an optimal way resulting in efficient BLAS operations. 2 Minimum Block Storage in a Distributed Environment The current industry standard for distributed memory computing views the processors as a PxQ mesh and uses a 2D Block Cyclic Layout (BCL) of full format arrays. This has proven to be a good choice for achieving effective load balancing. However, this wastes about half the storage for triangular and symmetric matrices. There is currently no industry standard for packed storage. The SBP storage of Section 1 is a possibility.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimizing Locality of Reference in Cholesky Algorithms1

This paper presents the principle ideas involved in hierarchical blocking, introduces the block packed storage scheme, and gives the implementation details and the performance rates of the hierarchically blocked Cholesky factorization. In some cases the newly developed routines are faster by an order of magnitude than the corresponding Lapack routines. Introduction Most current computers based ...

متن کامل

A distributed packed storage for large dense parallel in-core calculations

We propose in this paper a distributed packed storage format that exploits the symmetry or the triangular structure of a dense matrix. This format stores only half of the matrix while maintaining most of the efficiency compared to a full storage for a wide range of operations. This work has been motivated by the fact that, contrary to sequential linear algebra libraries (e.g. LAPACK [4]), there...

متن کامل

High Performance Cholesky Factorization via Blocking and Recursion That Uses Minimal Storage

We present a high performance Cholesky factorization algorithm , called BPC for Blocked Packed Cholesky, which performs better or equivalent to the LAPACK DPOTRF subroutine, but with about the same memory requirements as the LAPACK DPPTRF subroutine, which runs at level 2 BLAS speed. Algorithm BPC only calls DGEMM and level 3 kernel routines. It combines a recursive algorithm with blocking and ...

متن کامل

LAPACK Working Note ? LAPACK Block Factorization Algorithms on the Intel iPSC / 860 ∗

The aim of this project is to implement the basic factorization routines for solving linear systems of equations and least squares problems from LAPACK—namely, the blocked versions of LU with partial pivoting, QR, and Cholesky on a distributed-memory machine. We discuss our implementation of each of the algorithms and the results we obtained using varying orders of matrices and blocksizes.

متن کامل

Efficient Methods for Out-of-Core Sparse Cholesky Factorization

We consider the problem of sparse Cholesky factorization with limited main memory. The goal is to e ciently factor matrices whose Cholesky factors essentially ll the available disk storage, using very little memory (as little as 16 Mbytes). This would enable very large industrial problems to be solved with workstations of very modest cost. We consider three candidate algorithms. Each is based o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006